Red Wine Data Exploration by Simone Romero

This project is part of the “Explore and Summarize Data” module from Udacity’s Data Scientist Nanodegree Program.

To develop this project the chosen data set was Red Wine Quality, which is public available for research, and more details are described in:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

The exploratory analysis will be guided by the following question: Which chemical properties influence the quality of red wines?

Dataset overview

The dataset variables are:

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume) Output variable (based on sensory data):
  12. quality (score between 0 and 10)

The variable types are:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We can observe that there are discrete and continuous variables, and the X variable is just an index for each observation in the dataset, so let’s remove it.

red_wine <- within(red_wine, rm(X))

Let’s see the distribution of our variables in the dataset.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The variables fixed.acidity, volatile.acidity, citric.acid, residual.sugar, free.sulfur.dioxide, and total.sulfur.dioxide presented high dispersion, which may mean the existence of outliers.

Regarding to the wine quality, ratings are among 3 and 8, being 6 the median quality value.

Univariate Plots Section

Towards an univariate analysis, let’s plot some histograms to understand the structure of the individual variables in the dataset.

Density and pH plots presented a normal distribution, while citric.acid, free.sulfer.dioxide, and total.sufer.dioxide presented a right skewed distribution. Outliers can be observed mainly for residual.sugar and chlorides plots.

Removing outliers

Let’s remove the outliers from residual.sugar and chlorides features to improve the data readability.

The original histogram and the summary for residual.sugar feature are the following:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

According to this result, the interquartile range is IQR = 0.7 (IQR = Q3 - Q1). This value can be used as a parameter to break the histogram, by removing the data point outside of the upper and lower fences, in other words, this approach will ignore the outliers.

The plots above show the application of log-scale only and log-scale plus the IQR value to define the breaks.

The same approach was applied to chlorides feature. Bellow we have the histogram and summary for this feature.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

In this case, the IQR value is 0.02, being the upper and lower fences 0.12 and 0.04, respectively.

Scaling the data and removing the outliers provides a better visualization of the data distribution.

Creating a new variable

Given the summary and the plot of the quality feature, we can observe that most of the observations are classified as 5 or 6, which represents the median. Few examples were classified between 3 and 4, and 7 and 8, which represents the wines of low and high quality, respectively. Based on that, we decided to group the data into 3 categories: low, average, and high.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

According to the summary information from quality feature, we have: Q1 = 5, Q2 = 6, and Q3 = 6. Observations with quality rating lower than 5 were classified as low, observations with quality rating between 5 and 6 were classified as average, to be closer to the median, and observations with quality rating equal or higher than 7 were classified as high, as shown in the plot below.

##     low average    high 
##      63    1319     217

Univariate Analysis

What is the structure of your dataset?

The dataset contains 1,599 observations of different types of red wines and 11 chemical properties were considered in the analysis. Thus, the original dataset is composed of 12 features being 11 chemical properties and the score given by the experts, named as quality.

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is quality since it represents the experts’ opinion about the wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think volatile.acidity, citric.acid, total.sufer.dioxide, pH, and the percent alcohol of the wine are the features that can support the investigation since they are the features that contribute most to the smell and taste of wine.

Did you create any new variables from existing variables in the dataset?

Yes, I created the ‘rating’ variable, which is a categorical representation of wine quality: low (3,4), average (5,6), and high (7,8).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I have removed the X variable, which represented the dataset index.

Bivariate Plots Section

In this section, we are going to explore the following features: volatile.acidity, citric.acid, total.sulfur.dioxide, pH, and alcohol.

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5650  0.6800  0.7242  0.8825  1.5800 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4055  0.4900  0.9150

The boxplots show that as the volatile.acidity decreases the wine quality increases. There is a big difference between the medians of the plots that represent wines of low and high quality, that is, the value of volatile.acidity for high quality wines is practically half of what was found in low quality wines.

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0800  0.1737  0.2700  1.0000 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2583  0.4000  0.7900 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.4000  0.3765  0.4900  0.7600

For the citric.acid feature occurs the opposite of what happens with volatile.acidity. As the acidity rises, the quality of the wine increases. The median value for the high quality wine is five times the found for low quality wines.

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   13.50   26.00   34.44   48.00  119.00 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   24.00   40.00   48.95   65.00  165.00 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.00   27.00   34.89   43.00  289.00

Significant result was not found for total.sulfur.dioxide feature.

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.380   3.384   3.500   3.900 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.210   3.310   3.311   3.400   4.010 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.270   3.289   3.380   3.780

For pH feature, the boxplots show that a high pH scale can produce low quality wines. However it is not clear which is the ideal pH to produce high quality wines, since the values for average and high quality wines are very similar (3.31 and 3.27, respectively).

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Regarding the percent alcohol content of the wine, the boxplots show that the higher the percentage of alcohol the better. However, it is not possible to determine how much alcohol can produce low quality wines.

Now, let’s analyze the correlation between some of the features chosen.

## [1] -0.5419041

The boxplots showed a clear relationship between the pH scale and the citric.acid values. With a lower pH, the citric value increases as the wine becomes more acidic, and wines with higher acidic level (pH < 3.27) have received the ‘high’ rating.

The plot above shows a negative correlation of -0.5419 between pH and citric.acid features.

## [1] 0.1099032

Alcohol and citric.acid presented important roles in the high quality wines, however there is no particular relationship between both features (positive correlation of 0.1099), as presented above.

## [1] 0.2056325

Still trying to relate the acidity to the alcohol level of the wine, the alcohol and the pH features presented a positive correlation of 0.2056.

## [1] -0.4961798

Using a different feature to try to understand the importance of alcohol in the production of high quality wines, the plot above shows the relationship between alcohol and density. They presented a negative correlation of -0.4961. In other words, the higher the alcohol level, the lower the density of wine.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The feature volatile.acidity represents the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. Thus, in the boxplots it is possible to observe the high relationship between this feature and the quality rating, since wines with elevated volatile.acidity obtained low quality rating, whereas wines with lower volatile.acidity obtained high quality rating. For the wines that obtaines high as quality rating, the 3rd Quartile value (0.4900) is lower than the median value (0.5400) of the boxplot that represents the wines that obtained average quality rating. In other words, the concentration of volatile.acidity is lower for the high quality wines.

The median values presented in the boxplots of the citric.acid feature were: 0.0800 for low, 0.2400 for average, and 0.4000 for high quality rating. This means that high quality wines present higher concentration of citric.acid, which is inversely proportional to that presented in the volatile.acidity plots.

For the total.sulfur.dioxide feature, there was not a clear correlation to the quality feature, since the boxplots presentes very close median for low and high quality wines (26.00 and 27.00, respectively).

The pH feature describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic), most wines are between 3-4 on the pH scale. According to the boxplots, wines with pH above 3.380 are considered of low quality, whereas wines with pH scale lower than 3.310 or 3.270 can be considered of average or high quality, respectively. In other words, acidic wines are better.

Regarding the percent alcohol content of the wine, the boxplots show that the higher the percentage of alcohol the better.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile.acidity and citric.acid presented a high negative correlation (-0.5524).

Although wines with higher alcohol content and higher acidity have received the high quality classification, the relationship between these features is not very significant, being a positive correlation of 0.2056 between alcohol and pH, and of 0.1099 between alcohol and citric.acid.

Other interesting relationship was observed between alcohol and density. They presented a negative correlation of -0.4961. In other words, the higher the alcohol level, the lower the density of wine.

What was the strongest relationship you found?

The strongest relationship was found for volatile.acidity and citric.acid, they presented a negative correlation of -0.5524.

Multivariate Plots Section

Based on the results of the previous section, when comparing citric.acid and volatile.acidity, we observed that most of the high quality wines presented high citric.acid concentration and low volatile.acidity concentration. The reverse is true for wines that have obtained low quality rating. Thus, in the plot below we put these two features together.

The pH and alcohol features were also analyzed previously. In the plot below it is possible to see how the highest pH contributed to the low classification rating of red wines.

The results showed that alcohol is an important characteristic for wine classification, so we compare this variable with others that may directly impact the high or low quality rating of a wine.

In the following plot, we observed that low quality wines have higher density and low alcohol level.

For alcohol and volatile.acidity features it is clear that low volatile.acidity and high alcohol level are very important to the wine classification as high quality.

Other important feature is citric.acid, however when comparing it to alcohol, there is nothing too striking about the concentration of these features to producing low or high quality wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the multivariate analysis six features were considered: alcohol, pH, volatile.acidity, citric.acid, density, and rating (categorical for quality).

When grouped together, the role of each of these chemical properties in the manufacture of high quality wines is evident:

  • High citric.acid concentration and low volatile.acidity concentration.
  • High alcohol level and low pH scale.
  • High alcohol level and low density.

Considering the important role of alcohol level, we also compared it with other features. When compared to volatile.acidity it was clear that low volatile.acidity and high alcohol level are very important to the wine classification as high quality. However, when alcohol was plotted with citric.acid, no clear relationship was observed.

Were there any interesting or surprising interactions between features?

I was surprised that there was no clear relationship between alcohol and citric.acid.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Description One

This plot is interesting because the boxplots show that the higher the percentage of alcohol the higher the quality of wine. The median alcohol level for high quality wine is 11.60 and the mean is 11.52. For the low quality wines, the 3rd Quartile was 11.00.

Plot Two

Description Two

In this plot it is possible to observe the importance of volatile.acidity and citric.acid to obtain high quality wines. Most of the high quality wines (yellow points) presented high citric.acid concentration and low volatile.acidity concentration, whereas the low quality wines (violet points) presented low citric.acid concentration and high volatile.acidity value.

Plot Three

Description Three

Results similar to those presented in the previous graphs can be observed here when we compare the level of alcohol with the volatile.acidity. A high concentration of acid volatility contributes to the production of low quality wines, while high alcohol content contributes to the production of high quality wines.


Reflection

The dataset analyzed contains 1,599 observations of different types of red wines and 11 chemical properties were considered in the analysis. Thus, the original dataset is composed of 12 features being 11 chemical properties and the score given by the experts, namely as quality.

The quality score range from 1 to 10. Given the summary of this feature, we observed that most of the instances are classified as 5 or 6 and only a few ones were classified between 3 and 4, and 7 and 8. Based on that, the data was grouped into 3 categories, namely as: low (for quality score less than 5), average (for quality score less than 7), and high ( for quality score higher than 7).

Based on an initial analysis, volatile.acidity, citric.acid, total.sufer.dioxide, pH, and the percent alcohol of the wine were the features that considered to support the investigation since they are the features that contribute most to the smell and taste of wine.

Based on the plots produced, it was possible to observe that not all the features presented a definitive role in the wines classification. Volatile.acidity, citric.acid, and alcohol level are the ones that stood out the most.

Considering the process itself, it was very important to note that even the dataset containing not so many features, not all are representative for the classification task. In addition, this whole process of exploiting the data through graphics is laborious but can save us a lot of time during modeling.

References

https://github.com/agapic/Data-Analyst-Nanodegree-Udacity/tree/master/Project%204%20-%20Explore%20and%20Summarize%20Data%20with%20R

https://github.com/baocongchen/Explore-and-Summarize-Data/blob/master/projectTemplate.Rmd

https://github.com/BlaneG/explore-and-summarize-data/blob/master/Red_Wine_Analysis.Rmd